Face detection

ABSTRACT

Face detection apparatus in which an image region of a test image is compared with data indicative of the presence of a face comprises: a pre-processor operable to identify low-difference regions of the test image where there exists less than a threshold image difference across groups of pixels within those regions; and a face detector operable to perform face detection on regions of the test image other than those identified by the pre-processor as low-difference regions.

This invention relates to face detection.

Many human-face detection algorithms have been proposed in theliterature, including the use of so-called eigenfaces, face templatematching, deformable template matching or neural network classification.None of these is perfect, and each generally:has associated advantagesand disadvantages. None gives an absolutely reliable indication that animage contains a face; on the contrary, they are all based upon aprobabilistic assessment, based on a mathematical analysis of the image,of whether the image has at least a certain likelihood of containing aface. Depending on their application, the algorithms generally have thethreshold likelihood value set quite high, to try to avoid falsedetections of faces.

In any sort of block-based analysis of a possible face, or an analysisinvolving a comparison between the possible face and some pre-deriveddata indicative of the presence of a face, there is a possibility thatthe algorithm will be confused by an image region which, while possiblylooking nothing like a face, may possess certain image attributes topass the comparison test. Such a region may then be assigned a highprobability of containing a face, and can lead to a false-positive facedetection.

It is a constant aim in this technical field to improve the reliabilityof face detection, including reducing the occurrence of false-positivedetections.

This invention provides face detection apparatus in which an imageregion of a test image is compared with data indicative of the presenceof a face; the apparatus comprising:

a pre-processor operable to identify low-difference regions of the testimage where there exists less than a threshold image difference acrossgroups of pixels within those regions; and

a face detector operable to perform face detection on regions of thetest image other than those identified by the pre-processor aslow-difference regions.

The invention involves the unexpected recognition that in fact areas ofan image having very little detail, or difference between adjacentpixels, may in fact confuse the face recognition process into falselydetecting a face in those image areas.

Accordingly, in a broadest aspect of the invention any such parts of aregion under test, where there exists less than a threshold luminancedifference between adjacent pixels, are excluded from the face detectionprocess. This can lead to a more reliable detection of faces in theimage.

Various further respective aspects and features of the invention aredefined in the appended claims.

Embodiments of the invention will now be described, by way of exampleonly, with reference to the accompanying drawings, throughout which likeparts are defined by like numerals, and in which:

FIG. 1 is a schematic diagram of a general purpose computer system foruse as a face detection system and/or a non-linear editing system;

FIG. 2 is a schematic diagram of a video camera-recorder (camcorder)using face detection;

FIG. 3 is a schematic diagram illustrating a training process;

FIG. 4 is a schematic diagram illustrating a detection process;

FIG. 5 schematically illustrates a feature histogram;

FIG. 6 schematically illustrates a sampling process to generateeigenblocks;

FIGS. 7 and 8 schematically illustrates sets of eigenblocks;

FIG. 9 schematically illustrates a process to build a histogramrepresenting a block position;

FIG. 10 schematically illustrates the generation of a histogram binnumber;

FIG. 11 schematically illustrates the calculation of a face probability,

FIGS. 12 a to 12 f are schematic examples of histograms generated usingthe above methods;

FIGS. 13 a to 13 g schematically illustrate so-called multiscale facedetection;

FIG. 14 schematically illustrates a face tracking algorithm;

FIGS. 15 a and 15 b schematically illustrate the derivation of a searcharea used for skin colour detection;

FIG. 16 schematically illustrates a mask applied to skin colourdetection;

FIGS. 17 a to 17 c schematically illustrate the use of the mask of FIG.16;

FIG. 18 is a schematic distance map;

FIGS. 19 a to 19 c schematically illustrate the use of face trackingwhen applied to a video scene;

FIG. 20 schematically illustrates a display screen of a non-linearediting system;

FIGS. 21 a and 21 b schematically illustrate clip icons; FIGS. 22 a to22 c schematically illustrate a gradient pre-processing technique;

FIG. 23 schematically illustrates a video conferencing system;

FIGS. 24 and 25 schematically illustrate a video conferencing system ingreater detail;

FIG. 26 is a flowchart schematically illustrating one mode of operationof the system of FIGS. 23 to 25;

FIGS. 27 a and 27 b are example images relating to the flowchart of FIG.26;

FIG. 28 is a flowchart schematically illustrating another mode ofoperation of the system of FIGS. 23 to 25;

FIGS. 29 and 30 are example images relating to the flowchart of FIG. 28;

FIG. 31 is a flowchart schematically illustrating another mode ofoperation of the system of FIGS. 23 to 25;

FIG. 32 is an example image relating to the flowchart of FIG. 31; and

FIGS. 33 and 34 are flowcharts schematically illustrating further modesof operation of the system of FIGS. 23 to 25;

FIG. 1 is a schematic diagram of a general purpose computer system foruse as a face detection system and/or a non-linear editing system. Thecomputer system comprises a processing until 10 having (amongst otherconventional components) a central processing unit (CPU) 20, memory suchas a random access memory (RAM) 30 and non-volatile storage such as adisc drive 40. The computer system may be connected to a network 50 suchas a local area network or the Internet (or both). A keyboard 60, mouseor other user input device 70 and display screen 80 are also provided.The skilled man will appreciate that a general purpose computer systemmay include many other conventional parts which need not be describedhere.

FIG. 2 is a schematic diagram of a video camera-recorder (camcorder)using face detection. The camcorder 100 comprises a lens 110 whichfocuses an image onto a charge coupled device (CCD) image capture device120. The resulting image in electronic form is processed by imageprocessing logic 130 for recording on a recording medium such as a tapecassette 140. The images captured by the device 120 are also displayedon a user display 150 which may be viewed through an eyepiece 160.

To capture sounds associated with the images, one or more microphonesare used. These may be external microphones, in the sense that they areconnected to the camcorder by a flexible cable, or maybe mounted on thecamcorder body itself. Analogue audio signals from the microphone (s)are processed by an audio processing arrangement 170 to produceappropriate audio signals for recording on the storage medium 140.

It is noted that the video and audio signals may be recorded on thestorage medium 140 in either digital form or analogue form, or even inboth forms. Thus, the image processing arrangement 130 and the audioprocessing arrangement 170 may include a stage of analogue to digitalconversion.

The camcorder user is able to control aspects of the lens 110'sperformance by user controls 180 which influence a lens controlarrangement 190 to send electrical control signals 200 to the lens 110.Typically, attributes such;;as focus and zoom are controlled in thisway, but the lens aperture or other attributes may also be controlled bythe user.

Two further user controls are schematically illustrated. A push button210 is provided to initiate and stop recording onto the recording medium140. For example, one push of the control 210 may start recording andanother push may stop recording, or the control may need to be held in apushed state for recording to take place, or one push may startrecording for a certain timed period, for example five seconds. In anyof these arrangements, it is technologically very straightforward toestablish from the camcorder's record operation where the beginning andend of each “shot” (continuous period of recording) occurs.

The other user control shown schematically in FIG. 2 is a “good shotmarker” (GSM) 220, which may be operated by the user to cause “metadata”(associated data) to be stored in connection with the video and audiomaterial on the recording medium 140, indicating that this particularshot was subjectively considered by the operator to be “good” in somerespect (for example, the actors performed particularly well; the newsreporter pronounced each word correctly, and so on).

The metadata may be recorded in some spare capacity (e.g. “user data”)on the recording medium 140, depending on the particular format andstandard in use. Alternatively, the metadata can be stored on a separatestorage medium such as a removable MemoryStick® memory (not shown), orthe metadata could be stored on an external database (not shown), forexample being communicated to such a database by a wireless link (notshown). The metadata can include not only the GSM information but alsoshot boundaries, lens attributes, alphanumeric information input by auser (e.g. on a keyboard—not shown), geographical position informationfrom a global positioning system receiver (not shown) and so on.

So far, the description has covered a metadata-enabled camcorder. Now,the way in which face detection may be applied to such a camcorder willbe described.

The camcorder includes a face detector arrangement 230. Appropriatearrangements will be described in much greater detail below, but forthis part of the description it is sufficient to say that the facedetector arrangement 230 receives images from the image processingarrangement 130 and detects, or attempts to detect, whether such imagescontain one or more faces. The face detector may output face detectiondata which could be in the form of a “yes/no” flag or maybe moredetailed in that the data could include the image co-ordinates of thefaces, such as the co-ordinates of eye positions within each detectedface. This information may be treated as another type of metadata andstored in any of the other formats described above.

As described below, face detection may be assisted by using other typesof metadata within the detection process. For example, the face detector230 receives a control signal from the lens control arrangement 190 toindicate the current focus and zoom settings of the lens 110. These canassist the face detector by giving an initial indication of the expectedimage size of any faces that may be present in the foreground of theimage. In this regard, it is noted that the focus and zoom settingsbetween them define the expected separation between the camcorder 100and a person being filmed, and also the magnification of the lens 110.From these two attributes, based upon an average face size, it ispossible to calculate the expected size (in pixels) of a face in theresulting image data.

A conventional (known) speech detector 240 receives audio informationfrom the audio processing arrangement 170 and detects the presence ofspeech in such audio information. The presence of speech may be anindicator that the likelihood of a face being present in thecorresponding images is higher than if no speech is detected.

Finally, the GSM information 220 and shot information (from the control210) are supplied to the face detector 230, to indicate shot boundariesand those shots considered to be most useful by the user.

Of course, if the camcorder is based upon the analogue recordingtechnique, further analogue to digital converters (ADCs) may be requiredto handle the image and audio information.

The present embodiment uses a face detection technique arranged as twophases. FIG. 3 is a schematic diagram illustrating a training phase, andFIG. 4 is a schematic diagram illustrating a detection phase.

Unlike some previously proposed face detection methods (see References 4and 5 below), the present method is based on modelling the face in partsinstead of as a whole. The parts can either be blocks centred over theassumed positions of the facial features (so-called “selectivesampling”) or blocks sampled at regular intervals over the face(so-called “regular sampling”). The present description will coverprimarily regular sampling, as this was found in empirical tests to givethe better results.

In the training phase, an analysis process is applied to a set of imagesknown to contain faces, and (optionally) another set of images (“nonfaceimages”) known not to contain faces. The analysis process builds amathematical model of facial and nonfacial features, against which atest image can later be compared (in the detection phase).

So, to build the mathematical model (the training process 310 of FIG.3), the basic steps are as follows:

1. From a set 300 of face images normalised to have the same eyepositions, each face is sampled regularly into small blocks.

2. Attributes are calculated for each block; these attributes areexplained further below.

3. The attributes are quantised to a manageable number of differentvalues.

4. The quantised attributes are then combined to generate a singlequantised value in respect of that block position.

5. The single quantised value is then recorded as an entry in ahistogram, such as the schematic histogram of FIG. 5. The collectivehistogram information 320 in respect of all of the block positions inall of the training images forms the foundation of the mathematicalmodel of the facial features.

One such histogram is prepared for each possible block position, byrepeating the above steps in respect of a large number of test faceimages. The test data are described further in Appendix A below. So, ina system which uses an array of 8×8 blocks, 64 histograms are prepared.In a later part of the processing, a test quantised attribute iscompared with the histogram data; the fact that a whole histogram isused to model the data means that no assumptions have to be made aboutwhether it follows a parameterised distribution, e.g. Gaussian orotherwise. To save data storage space (if needed), histograms which aresimilar can be merged so that the same histogram can be reused fordifferent block positions.

In the detection phase, to apply the face detector to a test image 350,successive windows in the test image are processed 340 as follows:

6. The window is sampled regularly as a series of blocks, and attributesin respect of each block are calculated and quantised as in stages 1-4above.

7. Corresponding “probabilities” for the quantised attribute values foreach block position are looked up from the corresponding histograms.That is to say, for each block position, a respective quantisedattribute is generated and is compared with a histogram previouslygenerated in respect of that block position. The way in which thehistograms give rise to “probability” data will be described below.

8. All the probabilities obtained above are multiplied together to forma final probability which is compared against a threshold in order toclassify the window as “face” or “nonface”. It will be appreciated thatthe detection result of “face” or “nonface” is a probability-basedmeasure rather than an absolute detection. Sometimes, an image notcontaining a face may be wrongly detected as “face”, a so-called falsepositive. At other times, an image containing a face may be wronglydetected as “nonface”, a so-called false negative. It is an aim of anyface detection system to reduce the proportion of false positives andthe proportion of false negatives, but it is of course understood thatto reduce these proportions to zero is difficult, if not impossible,with current technology.

As mentioned above, in the training phase, a set of “nonface” images canbe used to generate a corresponding set of “nonface” histograms. Then,to achieve detection of a face, the “probability” produced from thenonface histograms may be compared with a separate threshold, so thatthe probability has to be under the threshold for the test window tocontain a face. Alternatively, the ratio of the face probability to thenonface probability could be compared with a threshold.

Extra training data may be generated by applying “synthetic variations”330 to the original training set, such as variations in position,orientation, size, aspect ratio, background scenery, lighting intensityand frequency content.

The derivation of attributes and their quantisation will now bedescribed. In the present technique, attributes are measured withrespect to so-called eigenblocks, which are core blocks (oreigenvectors) representing different types of block which may be presentin the windowed image. The generation of eigenblocks will first bedescribed with reference to FIG. 6.

Eigenblock Creation

The attributes in the present embodiment are based on so-calledeigenblocks. The eigenblocks were designed to have good representationalability of the blocks in the training set. Therefore, they were createdby performing principal component analysis on a large set of blocks fromthe training set. This process is shown schematically in FIG. 6 anddescribed in more detail in Appendix B.

Training the System

Experiments were performed with two different sets of training blocks.

Eigenblock Set I

Initially, a set of blocks were used that were taken from 25 face imagesin the training set. The 16×16 blocks were sampled every 16 pixels andso were non-overlapping. This sampling is shown in FIG. 6. As can beseen, 16 blocks are generated from each 64×64 training image. This leadsto a total of 400 training blocks overall.

The first 10 eigenblocks generated from these training blocks are shownin FIG. 7.

Eigenblock Set II

A second set of eigenblocks was generated from a much larger set oftraining blocks. These blocks were taken from 500 face images in thetraining set. In this case, the 16×16 blocks were sampled every 8 pixelsand so overlapped by 8 pixels. This generated 49 blocks from each 64×64training image and led to a total of 24,500 training blocks.

The first 12 eigenblocks generated from these training blocks are shownin FIG. 8.

Empirical results show that eigenblock set II gives slightly betterresults than set I. This is because it is calculated from a larger setof training blocks taken from face images, and so is perceived to bebetter at representing the variations in faces. However, the improvementin performance is not large.

Building the Histograms

A histogram was built for each sampled block position within the 64×64face image. The number of histograms depends on the block spacing. Forexample, for block spacing of 16 pixels, there are 16 possible blockpositions and thus 16 histograms are used.

The process used to build a histogram representing a single blockposition is shown in FIG. 9. The histograms are created using a largetraining set 400 of M face images. For each face image, the processcomprises:

-   -   Extracting 410 the relevant block from a position (i,j) in the        face image.    -   Calculating the eigenblock-based attributes for the block, and        determining the relevant bin number 420 from these attributes.    -   Incrementing the relevant bin number in the histogram 430.

This process is repeated for each of M images in the training set, tocreate a histogram that gives a good representation of the distributionof frequency of occurrence of the attributes. Ideally, M is very large,e.g. several thousand. This can more easily be achieved by using atraining set made up of a set of original faces and several hundredsynthetic variations of each original face.

Generating the Histogram Bin Number

A histogram bin number is generated from a given block using thefollowing process, as shown in FIG. 10. The 16×16 block 440 is extractedfrom the 64×64 window or face image. The block is projected onto the set450 of A eigenblocks to generate a set of “eigenblock weights”. Theseeigenblock weights are the “attributes” used in this implementation.They have a range of −1 to +1. This process is described in more detailin Appendix B. Each weight is quantised into a fixed number of levels,L, to produce a set of quantised attributes 470, w_(i), i=1..A. Thequantised weights are combined into a single value as follows:h=w ₁ L ^(A-1) +w ₂ L ^(A-2) +w ₃ L ^(A-3) + . . . +w _(A-1) L ¹ +w _(A)L ⁰where the value generated, h, is the histogram bin number 480. Note thatthe total number of bins in the histogram is given by L^(A).

The bin “contents”, i.e. the frequency of occurrence of the set ofattributes giving rise to that bin number, may be considered to be aprobability value if it is divided by the number of training images M.However, because the probabilities are compared with a threshold, thereis in fact no need to divide through by M as this value would cancel outin the calculations. So, in the following discussions, the bin“contents” will be referred to as “probability values”, and treated asthough they are probability values, even though in a strict sense theyare in fact frequencies of occurrence.

The above process is used both in the training phase and in thedetection phase.

Face Detection Phase

The face detection process involves sampling the test image with amoving 64×64 window and calculating a face probability at each windowposition.

The calculation of the face probability is shown in FIG. 11. For eachblock position in the window, the block's bin number 490 is calculatedas described in the previous section. Using the appropriate histogram500 for the position of the block, each bin number is looked up and theprobability 510 of that bin number is determined. The sum 520 of thelogs of these probabilities is then calculated across all the blocks togenerate a face probability value, P_(face) (otherwise referred to as alog likelihood value).

This process generates a probability “map” for the entire test image. Inother words, a probability value is derived in respect of each possiblewindow centre position across the image. The combination of all of theseprobability values into a rectangular (or whatever) shaped array is thenconsidered to be a probability “map” corresponding to that image.

This map is then inverted, so that the process of finding a faceinvolves finding minima in the inverted map. A so-called distance-basedtechnique is used. This technique can be summarised as follows: The map(pixel) position with the smallest value in the inverted probability mapis chosen. If this value is larger than a threshold (TD), no more facesare chosen. This is the termination criterion. Otherwise a face-sizedblock corresponding to the chosen centre pixel position is blanked out(i.e. omitted from the following calculations) and the candidate faceposition finding procedure is repeated on the rest of the image untilthe termination criterion is reached.

Nonface Method

The nonface model comprises an additional set of histograms whichrepresent the probability distribution of attributes in nonface images.The histograms are created in exactly the same way as for the facemodel, except that the training images contain examples of nonfacesinstead of faces.

During detection, two log probability values are computed, one using theface model and one using the nonface model. These are then combined bysimply subtracting the nonface probability from the face probability:P _(combined) =P _(face) −P _(nonface)

P_(combined) is then used instead of P_(face) to produce the probabilitymap (before inversion).

Note that the reason that P_(nonface) is subtracted from P_(face) isbecause these are log probability values.

Histogram Examples

FIGS. 12 a to 12 f show some examples of histograms generated by thetraining process described above.

FIGS. 12 a, 12 b and 12 c are derived from a training set of faceimages, and FIGS. 12 d, 12 e and 12 f are derived from a training set ofnonface images. In particular: Face Nonface histograms histograms Wholehistogram Zoomed onto the main peaks at about h = 1500 A further zoomonto the region about h = 1570

It can clearly be seen that the peaks are in different places in theface histogram and the nonface histograms.

Multiscale Face Detection

In order to detect faces of different sizes in the test image, the testimage is scaled by a range of factors and a distance (i.e. probability)map is produced for each scale. In FIGS. 13 a to 13 c the images andtheir corresponding distance maps are shown at three different scales.The method gives the best response (highest probability, or minimumdistance) for the large (central) subject at the smallest scale (FIG. 13a) and better responses for the smaller subject (to the left of the mainfigure) at the larger scales. (A darker colour on the map represents alower value in the inverted map, or in other words a higher probabilityof there being a face).Candidate face positions are extracted acrossdifferent scales by first finding the position which gives the bestresponse over all scales. That is to say, the highest probability(lowest distance) is established amongst all of the probability maps atall of the scales. This candidate position is the first to be labelledas a face. The window centred over that face position is then blankedout from the probability map at each scale. The size of the windowblanked out is proportional to the scale of the probability map.

Examples of this scaled blanking-out process are shown in FIGS. 13 a to13 c. In particular, the highest probability across all the maps isfound at the left hand side of the largest scale map (FIG. 13 c). Anarea 530 corresponding to the presumed size of a face is blanked off inFIG. 13 c. Corresponding, but scaled, areas 532, 534 are blanked off inthe smaller maps.

Areas larger than the test window may be blanked off in the maps, toavoid overlapping detections. In particular, an area equal to the sizeof the test window surrounded by a border half as wide/long as the testwindow is appropriate to avoid such overlapping detections.

Additional faces are detected by searching for the next best responseand blanking out the corresponding windows successively.

The intervals allowed between the scales processed are influenced by thesensitivity of the method to variations in size. It was found in thispreliminary study of scale invariance that the method is not excessivelysensitive to variations in size as faces which gave a good response at acertain scale often gave a good response at adjacent scales as well.

The above description refers to detecting a face even though the size ofthe face in the image is not known at the start of the detectionprocess. Another aspect of multiple scale face detection is the use oftwo or more parallel detections at different scales to validate thedetection process. This can have advantages if, for example, the face tobe detected is partially obscured, or the person is wearing a hat etc.

FIGS. 13 d to 13 g schematically illustrate this process. During thetraining phase, the system is trained on windows (divided intorespective blocks as described above) which surround the whole of thetest face (FIG. 13 d) to generate “full face” histogram data and also onwindows at an expanded scale so that only a central area of the testface is included (FIG. 13 e) to generate “zoomed in” histogram data Thisgenerates two sets of histogram data. One set relates to the “full face”windows of FIG. 13 d, and the other relates to the “central face area”windows of FIG. 13 e.

During the detection phase, for any given test window 536, the window isapplied to two different scalings of the test image so that in one (FIG.13 f) the test window surrounds the whole of the expected size of aface, and in the other (FIG. 13 g) the test window encompasses thecentral area of a face at that expected size. These are each processedas described above, being compared with the respective sets of histogramdata appropriate to the type of window. The log probabilities from eachparallel process are added before the comparison with a threshold isapplied.

Putting both of these aspects of multiple scale face detection togetherleads to a particularly elegant saving in the amount of data that needsto be stored.

In particular, in these embodiments the multiple scales for thearrangements of FIGS. 13 a to 13 c are arranged in a geometric sequence.In the present example, each scale in the sequence is a factor of$\sqrt[4]{2}$different to the adjacent scale in the sequence. Then, for the paralleldetection described with reference to FIGS. 13 d to 13 g, the largerscale, central area, detection is carried out at a scale 3 steps higherin the sequence, that is, 2¾ times larger than the “full face” scale,using attribute data relating to the scale 3 steps higher in thesequence. So, apart from at extremes of the range of multiple scales,the geometric progression means that the parallel detection of FIGS. 13d to 13 g can always be carried out using attribute data generated inrespect of another multiple scale three steps higher in the sequence.

The two processes (multiple scale detection and parallel scaledetection) can be combined in various ways. For example, the multiplescale detection process of FIGS. 13 a to 13 c can be applied first, andthen the parallel scale detection process of FIGS. 13 d to 13 g can beapplied at areas (and scales) identified during the multiple scaledetection process. However, a convenient and efficient use of theattribute data may be achieved by:

-   -   deriving attributes in respect of the test window at each scale        (as in FIGS. 13 a to 13 c)    -   comparing those attributes with the “full face” histogram data        to generate a “full face” set of distance maps    -   comparing the attributes with the “zoomed in” histogram data to        generate a “zoomed in” set of distance maps    -   for each scale n, combining the “full face” distance map for        scale n with the “zoomed in” distance map for scale n+3    -   deriving face positions from the combined distance maps as        described above with reference to FIGS. 13 a to 13 c

Further parallel testing can be performed to detect different poses,such as looking straight ahead, looking partly up, down, left, rightetc. Here a respective set of histogram data is required and the resultsare preferably combined using a “max” function, that is, the pose givingthe highest probability is carried forward to thresholding, the othersbeing discarded.

Face Tracking

A face tracking algorithm will now be described. The tracking algorithmaims to improve face detection performance in image sequences.

The initial aim of the tracking algorithm is to detect every face inevery frame of an image sequence. However, it is recognised thatsometimes a face in the sequence may not be detected. In thesecircumstances, the tracking algorithm may assist in interpolating acrossthe missing face detections.

Ultimately, the goal of face tracking is to be able to output someuseful metadata from each set of frames belonging to the same scene inan image sequence. This might include:

-   -   Number of faces.    -   “Mugshot” (a colloquial word for an image of a person's face,        derived from a term referring to a police file photograph) of        each face.    -   Frame number at which each face first appears.    -   Frame number at which each face last appears.    -   Identity of each face (either matched to faces seen in previous        scenes, or matched to a face database)—this requires some face        recognition also.

The tracking algorithm uses the results of the face detection algorithm,run independently on each frame of the image sequence, as its startingpoint. Because the face detection algorithm may sometimes miss (notdetect) faces, some method of interpolating the missing faces is useful.To this end, a Kalman filter was used to predict the next position ofthe face and a skin colour matching algorithm was used to aid trackingof faces. In addition, because the face detection algorithm often givesrise to false acceptances, some method of rejecting these is alsouseful.

The algorithm is shown schematically in FIG. 14.

The algorithm will be described in detail below, but in summary, inputvideo data 545 (representing the image sequence) is supplied to a facedetector of the type described in this application, and a skin colourmatching detector 550. The face detector attempts to detect one or morefaces in each image. When a face is detected, a Kalman filter 560 isestablished to track the position of that face. The Kalman filtergenerates a predicted position for the same face in the next image inthe sequence. An eye position comparator 570, 580 detects whether theface detector 540 detects a face at that position (or within a certainthreshold distance of that position) in the next image. If this is foundto be the case, then that detected face position is used to update theKalman filter and the process continues.

If a face is not detected at or near the predicted position, then a skincolour matching method 550 is used. This is a less precise facedetection technique which is set up to have a lower threshold ofacceptance than the face detector 540, so that it is possible for theskin colour matching technique to detect (what it considers to be) aface even when the face detector cannot make a positive detection atthat position. If a “face” is detected by skin colour matching, itsposition is passed to the Kalman filter as an updated position and theprocess continues.

If no match is found by either the face detector 450 or the skin colourdetector 550, then the predicted position is used to update the Kalmanfilter.

All of these results are subject to acceptance criteria (see below). So,for example, a face that is tracked throughout a sequence on the basisof one positive detection and the remainder as predictions, or theremainder as skin colour detections, will be rejected.

A separate Kalman filter is used to track each face in the trackingalgorithm.

In order to use a Kalman filter to track a face, a state modelrepresenting the face must be created. In the model, the position ofeach face is represented by a 4-dimensional vector containing theco-ordinates of the left and right eyes, which in turn are derived by apredetermined relationship to the centre position of the window and thescale being used: ${p(k)} = \begin{bmatrix}{FirstEyeX} \\{FirstEyeY} \\{SecondEyeX} \\{SecondEyeY}\end{bmatrix}$where k is the frame number.

The current state of the face is represented by its position, velocityand acceleration, in a 12-dimensional vector:${\hat{z}(k)} = \begin{bmatrix}{p(k)} \\{\overset{.}{p}(k)} \\{\overset{..}{p}(k)}\end{bmatrix}$First Face Detected

The tracking algorithm does nothing until it receives a frame with aface detection result indicating that there is a face present.

A Kalman filter is then initialised for each detected face in thisframe. Its state is initialised with the position of the face, and withzero velocity and acceleration: ${{\hat{z}}_{a}(k)} = \begin{bmatrix}{p(k)} \\0 \\0\end{bmatrix}$

It is also assigned some other attributes: the state model errorcovariance, Q and the observation error covariance, R. The errorcovariance of the Kalman filter, P, is also initialised. Theseparameters are described in more detail below. At the beginning of thefollowing frame, and every subsequent frame, a Kalman filter predictionprocess is carried out.

Kalman Filter Prediction Process

For each existing Kalman filter, the next position of the face ispredicted using the standard Kalman filter prediction equations shownbelow. The filter uses the previous state (at frame k-1) and some otherinternal and external variables to estimate the current state of thefilter (at frame k).{circumflex over (z)} _(b)(k)=Φ(k,k-1){circumflex over (z)} _(a)(k-1)  State prediction equationP _(b)(k)=Φ(k,k-1)P _(a)(k-1)Φ(k,k-1)^(T) +{tilde under (O)}(k)  Covariance prediction equationwhere {circumflex over (z)}_(b)(k) denotes the state before updating thefilter for frame k, {circumflex over (z)}_(a)(k-1) denotes the stateafter updating the filter for frame k-1 (or the initialised state if itis a new filter), and Φ(k,k-1) is the state transition matrix. Variousstate transition matrices were experimented with, as described below.Similarly, P_(b)(k) denotes the filter's error covariance beforeupdating the filter for frame k and P_(a)(k-1) denotes the filter'serror covariance after updating the filter for the previous frame (orthe initialised value if it is a new filter). P_(b)(k) can be thought ofas an internal variable in the filter that models its accuracy.

{tilde under (O)}(k) is the error covariance of the state model. A highvalue of {tilde under (O)}(k) means that the predicted values of thefilter's state (i.e. the face's position) will be assumed to have a highlevel of error. By tuning this parameter, the behaviour of the filtercan be changed and potentially improved for face detection.

State Transition Matrix

The state transition matrix, Φ(k,k-1), determines how the prediction ofthe next state is made. Using the equations for motion, the followingmatrix can be derived for Φ(k,k-1):${\Phi\left( {k,{k - 1}} \right)} = \begin{bmatrix}I_{4} & {I_{4}\Delta\quad t} & {\frac{1}{2}{I_{4}\left( {\Delta\quad t} \right)}^{2}} \\O_{4} & I_{4} & {I_{4}\Delta\quad t} \\O_{4} & O_{4} & I_{4}\end{bmatrix}$where O₄ is a 4×4 zero matrix and I₄ is a 4×4 identity matrix. Al cansimply be set to 1 (i.e. units of t are frame periods).

This state transition matrix models position, velocity and acceleration.However, it was found that the use of acceleration tended to make theface predictions accelerate towards the edge of the picture when no facedetections were available to correct the predicted state. Therefore, asimpler state transition matrix without using acceleration waspreferred: ${\Phi\left( {k,{k - 1}} \right)} = \begin{bmatrix}I_{4} & {I_{4}\Delta\quad t} & O_{4} \\O_{4} & I_{4} & O_{4} \\O_{4} & O_{4} & O_{4}\end{bmatrix}$

The predicted eye positions of each Kalman filter, {circumflex over(z)}_(b)(k), are compared to all face detection results in the currentframe (if there are any). If the distance between the eye positions isbelow a given threshold, then the face detection can be assumed tobelong to the same face as that being modelled by the Kalman filter. Theface detection result is then treated as an observation, y(k), of theface's current state: ${y(k)} = \begin{bmatrix}{p(k)} \\0 \\0\end{bmatrix}$where p(k) is the position of the eyes in the face detection result.This observation is used during the Kalman filter update stage to helpcorrect the prediction.Skin Colour Matching

Skin colour matching is not used for faces that successfully match facedetection results. Skin colour matching is only performed for faceswhose position has been predicted by the Kalman filter but have nomatching face detection result in the current frame, and therefore noobservation data to help update the Kalman filter.

In a first technique, for each face, an elliptical area centred on theface's previous position is extracted from the previous Same. An exampleof such an area 600 within the face window 610 is shown schematically inFIG. 16. A colour model is seeded using the chrominance data from thisarea to produce an estimate of the mean and covariance of the Cr and Cbvalues, based on a Gaussian model.

An area around the predicted face position in the current frame is thensearched and the position that best matches the colour model, againaveraged over an elliptical area, is selected. If the colour match meetsa given similarity criterion, then this position is used as anobservation, y(k), of the face's current state in the same way describedfor face detection results in the previous section.

FIGS. 15 a and 15 b schematically illustrate the generation of thesearch area. In particular, FIG. 15 a schematically illustrates thepredicted position 620 of a face within the next image 630. In skincolour matching, a search area 640 surrounding the predicted position620 in the next image is searched for the face.

If the colour match does not meet the similarity criterion, then noreliable observation data is available for the current frame. Instead,the predicted state, {circumflex over (z)}_(b)(k) is used as theobservation:y(k)={circumflex over (z)} _(b)(k)

The skin colour matching methods described above use a simple Gaussianskin colour model. The model is seeded on an elliptical area centred onthe face in the previous frame, and used to find the best matchingelliptical area in the current frame. However, to provide a potentiallybetter performance, two further methods were developed: a colourhistogram method and a colour mask method. These will now be described.

Colour Histogram Method

In this method, instead of using a Gaussian to model the distribution ofcolour in the tracked face, a colour histogram is used.

For each tracked face in the previous frame, a histogram of Cr and Cbvalues within a square window around the face is computed. To do this,for each pixel the Cr and Cb values are first combined into a singlevalue. A histogram is then computed that measures the frequency ofoccurrence of these values in the whole window. Because the number ofcombined Cr and Cb values is large (256×256 possible combinations), thevalues are quantised before the histogram is calculated.

Having calculated a histogram for a tracked face in the previous frame,the histogram is used in the current frame to try to estimate the mostlikely new position of the face by finding the area of the image withthe most similar colour distribution. As shown schematically in FIGS. 15a and 15 b, this is done by calculating a histogram in exactly the sameway for a range of window positions within a search area of the currentframe. This search area covers a given area around the predicted faceposition. The histograms are then compared by calculating the meansquared error (MSE) between the original histogram for the tracked facein the previous frame and each histogram in the current frame. Theestimated position of the face in the current frame is given by theposition of the minimum MSE.

Various modifications may be made to this algorithm, including:

-   -   Using three channels (Y, Cr and Cb) instead of two (Cr, Cb).    -   Varying the number of quantisation levels.    -   Dividing the window into blocks and calculating a histogram for        each block. In this way, the colour histogram method becomes        positionally dependent. The MSE between each pair of histograms        is summed in this method.    -   Varying the number of blocks into which the window is divided.    -   Varying the blocks that are actually used—e.g. omitting the        outer blocks which might only partially contain face pixels.

For the test data used in empirical trials of these techniques, the bestresults were achieved using the following conditions, although othersets of conditions may provide equally good or better results withdifferent test data:

-   -   3 channels (Y, Cr and Cb).    -   8 quantisation levels for each channel (i.e. histogram contains        8×8×8=512 bins).    -   Dividing the windows into 16 blocks.    -   Using all 16 blocks.        Colour Mask Method

This method is based on the method first described above. It uses aGaussian skin colour model to describe the distribution of pixels in theface.

In the method first described above, an elliptical area centred on theface is used to colour match faces, as this may be perceived to reduceor minimise the quantity of background pixels which might degrade themodel.

In the present colour mask model, a similar elliptical area is stillused to seed a colour model on the original tracked face in the previousframe, for example by applying the mean and covariance of RGB or YCrCbto set parameters of a Gaussian model (or alternatively, a defaultcolour model such as a Gaussian model can be used, see below). However,it is not used when searching for the best match in the current frame.Instead, a mask area is calculated based on the distribution of pixelsin the original face window from the previous frame. The mask iscalculated by finding the 50% of pixels in the window which best matchthe colour model. An example is shown in FIGS. 17 a to 17 c. Inparticular, FIG. 17 a schematically illustrates the initial window undertest; FIG. 17 b schematically illustrates the elliptical window used toseed the colour model; and FIG. 17 c schematically illustrates the maskdefined by the 50% of pixels which most closely match the colour model.

To estimate the position of the face in the current frame, a search areaaround the predicted face position is searched (as before) and the“distance” from the colour model is calculated for each pixel. The“distance” refers to a difference from the mean, normalised in eachdimension by the variance in that dimension. An example of the resultantdistance image is shown in FIG. 18. For each position in this distancemap (or for a reduced set of sampled positions to reduce computationtime), the pixels of the distance image are averaged over a mask-shapedarea. The position with the lowest averaged distance is then selected asthe best estimate for the position of the face in this frame.

This method thus differs from the original method in that a mask-shapedarea is used in the distance image, instead of an elliptical area. Thisallows the colour match method to use both colour and shape information.

Two variations are proposed and were implemented in empirical trials ofthe techniques:

(a) Gaussian skin colour model is seeded using the mean and covarianceof Cr and Cb from an elliptical area centred on the tracked face in theprevious frame.

(b) A default Gaussian skin colour model is used, both to calculate themask in the previous frame and calculate the distance image in thecurrent frame.

The use of Gaussian skin colour models will now be described further. AGaussian model for the skin colour class is built using the chrominancecomponents of the YCbCr colour space. The similarity of test pixels tothe skin colour class can then be measured. This method thus provides askin colour likelihood estimate for each pixel, independently of theeigenface-based approaches.

Let w be the vector of the CbCr values of a test pixel. The probabilityof w belonging to the skin colour class S is modelled by atwo-dimensional Gaussian:${p\left( {w❘S} \right)} = \frac{\exp\left\lbrack {{- \frac{1}{2}}\left( {w - \mu_{s}} \right){\sum\limits_{s}^{- 1}\left( {w - \mu_{s}} \right)}} \right\rbrack}{2\pi{\sum\limits_{s}}^{\frac{1}{2}}}$where the mean μ_(s) and the covariance matrix Σ_(s) of the distributionare (previously) estimated from a training set of skin colour values.

Skin colour detection is not considered to be an effective face detectorwhen used on its own. This is because there can be many areas of animage that are similar to skin colour but are not necessarily faces, forexample other parts of the body. However, it can be used to improve theperformance of the eigenblock-based approaches by using a combinedapproach as described in respect of the present face tracking system.The decisions made on whether to accept the face detected eye positionsor the colour matched eye positions as the observation for the Kalmanfilter, or whether no observation was accepted, are stored These areused later to assess the ongoing validity of the faces modelled by eachKalman filter.

Kalman Filter Update Step

The update step is used to determine-an appropriate output of the filterfor the current frame, based on the state prediction and the observationdata. It also updates the internal variables of the filter based on theerror between the predicted state and the observed state.

The following equations are used in the update step:K(k)=P _(b)(k)H ^(T)(k)(H(k)P _(b)(k)H ^(T)(k)+R(k))⁻¹   Kalman gainequation{circumflex over (z)} _(a)(k)={circumflex over (z)}_(b)(k)+K(k)[y(k)−H(k){circumflex over (z)} _(b)(k)]  State updateequationP _(a)(k)=P _(b)(k)−K(k)H(k)P _(b)(k)   Covariance update equation

Here, K(k) denotes the Kalman gain, another variable internal to theKalman filter. It is used to determine how much the predicted stateshould be adjusted based on the observed state, y(k).

H(k) is the observation matrix. It determines which parts of the statecan be observed. In our case, only the position of the face can beobserved, not its velocity or acceleration, so the following matrix isused for H(k): ${H(k)} = \begin{bmatrix}I_{4} & O_{4} & O_{4} \\O_{4} & O_{4} & O_{4} \\O_{4} & O_{4} & O_{4}\end{bmatrix}$

R(k) is the error covariance of the observation data In a similar way to{tilde under (O)}(k), a high value of R(k) means that the observedvalues of the filter's state (i.e. the face detection results or colourmatches) will be assumed to have a high level of error. By tuning thisparameter, the behaviour of the filter can be changed and potentiallyimproved for face detection. For our experiments, a large value of R(k)relative to {tilde under (O)}(k) was found to be suitable (this meansthat the predicted face positions are treated as more reliable than theobservations). Note that it is permissible to vary these parameters fromframe to frame. Therefore, an interesting future area of investigationmay be to adjust the relative values of R(k) and {tilde under (O)}(k)depending on whether the observation is based on a face detection result(reliable) or a colour match (less reliable).

For each Kalman filter, the updated state, {circumflex over (z)}_(a)(k),is used as the final decision on the position of the face. This data isoutput to file and stored.

Unmatched face detection results are treated as new faces. A new Kalmanfilter is initialised for each of these. Faces are removed which:

-   -   Leave the edge of the picture and/or    -   Have a lack of ongoing evidence supporting them (when there is a        high proportion of observations based on Kalman filter        predictions rather than face detection results or colour        matches).

For these faces, the associated Kalman filter is removed and no data isoutput to file. As an optional difference from this approach, where aface is detected to leave the picture, the tracking results up to theframe before it leaves the picture may be stored and treated as validface tracking results (providing that the results meet any othercriteria applied to validate tracking results).

These rules may be formalised and built upon by bringing in someadditional variables:

-   -   prediction_acceptance_ratio_threshold If, during tracking a        given face, the proportion of accepted Kalman predicted face        positions exceeds this threshold, then the tracked face is        rejected.    -    This is currently set to 0.8.    -   detection_acceptance_ratio_threshold During a final pass through        all the frames, if for a given face the proportion of accepted        face detections falls below this threshold, then the tracked        face is rejected.    -    This is currently set to 0.08.    -   min_frames During a final pass through all the frames, if for a        given face the number of occurrences is less than min_frames,        the face is rejected. This is only likely to occur near the end        of a sequence min_frames is currently set to 5.    -   final_prediction_acceptance_ratio_threshold and min_frames2        During a final pass through all the frames, if for a given        tracked face the number of occurrences is less than min _frames2        AND the proportion of accepted Kalman predicted face positions        exceeds the final_prediction_acceptance_ratio_threshold, the        face is rejected. Again, this is only likely to occur near the        end of a sequence. final_prediction_acceptance_ratio_threshold        is currently set to 0.5 and min_frames2 is currently set to 10.    -   min_eye_spacing Additionally, faces are now removed if they are        tracked such that the eye spacing is decreased below a given        minimum distance. This can happen if the Kalman filter falsely        believes the eye distance is becoming smaller and there is no        other evidence, e.g. face detection results, to correct this        assumption. If uncorrected, the eye distance would eventually        become zero. As an optional alternative, a minimum or lower        limit eye separation can be forced, so that if the detected eye        separation reduces to the minimum eye separation, the detection        process continues to search for faces having that eye        separation, but not a smaller eye separation.

It is noted that the tracking process is not limited to tracking througha video sequence in a forward temporal direction. Assuming that theimage data remain accessible (i.e. the process is not real-time, or theimage data are buffered for temporary continued use), the entiretracking process could be carried out in a reverse temporal direction.Or, when a first face detection is made (often part-.way through a videosequence) the tracking process could be initiated in both temporaldirections. As a further option, the tracking process could be run inboth temporal directions through a video sequence, with the resultsbeing combined so that (for example) a tracked face meeting theacceptance criteria is included as a valid result whichever directionthe tracking took place.

Overlap Rules for Face Tracking

When the faces are tracked, it is possible for the face tracks to becomeoverlapped.

When this happens, in at least some applications, one of the tracksshould be deleted. A set of rules is used to determine which face trackshould persist in the event of an overlap.

Whilst the faces are being tracked there are 3 possible types of track:

-   D: Face Detection—the current position of the face is confirmed by a    new face detection-   S: Skin colour track—there is no face detection, but a suitable skin    colour track has been found-   P: Prediction—there is neither a suitable face detection nor skin    colour track, so the predicted face position from the Kalman filter    is used.

The following grid defines a priority order if two face tracks overlapwith each other: Face 2 Face 1 D S P D Largest D D Face Size S D LargestS Face Size P D S Largest Face Size

So, if both tracks are of the same type, then the largest face sizedetermines which track is to be maintained. Otherwise, detected trackshave priority over skin colour or predicted tracks. Skin colour trackshave priority over predicted tracks.

In the tacking method described above, a face track is started for everyface detection that cannot be matched up with an existing track. Thiscould lead to many false detections being erroneously tracked andpersisting for several frames before finally being rejected by one ofthe existing rules (e.g. by the rule associated with theprediction_acceptance_ratio_threshold)

Also, the existing rules for rejecting a track (e.g. those rulesrelating to the variables prediction_acceptance_ratio_threshold anddetection_acceptance_ratio_threshold), are biased against trackingsomeone who turns their head to the side for a significant length oftime. In reality, it is often desirable to carry on tracking someone whodoes this.

A solution will now be described.

The first part of the solution helps to prevent false detections fromsetting off erroneous tracks. A face track is still started internallyfor every face detection that does not match an existing track. However,it is not output from the algorithm. In order for this track to bemaintained, the first f frames in the track must be face detections(i.e. of type D ). If all of the first f frames are of type D then thetrack is maintained and face locations are output from the algorithmfrom frame f onwards.

If all of the first n frames are not of type D, then the face track isterminated and no face locations are output for this track.

f is typically set to 2, 3 or 5.

The second part of the solution allows faces in profile to be trackedfor a long period, rather than having their tracks terminated due to alow detection_acceptance_ratio. To achieve this, where the faces arematched by the ±30° eigenblocks, the tests relating to the variablesprediction_acceptance_ratio_threshold anddetection_acceptance_ratio_threshold are not used. Instead, an option isto include the following criterion to maintain a face track:

g consecutive face detections are required every n frames to maintainthe face track

where g is typically set to a similar value to f; e.g. 1-5 frames and ncorresponds to the maximum number of frames for which we wish to be ableto track someone when they are turned away from the camera, e.g. 10seconds (=250 or 300 frames depending on frame rate).

This may also be combined with the prediction_acceptance_ratio_thresholdand detection_acceptance_ratio_threshold rules. Alternatively, theprediction_acceptance_ratio_threshold anddetection_acceptance_ratio_threshold may be applied on a rolling basise.g. over only the last 30 frames, rather than since the beginning ofthe track.

Another criterion for rejecting a face track is that a so-called “badcolour threshold” is exceeded. In this test a tracked face position isvalidated by skin colour (whatever the acceptance type—face detection orKalman prediction). Any face whose distance from an expected skin colourexceeds a given “bad colour threshold” has its track terminated.

In the method described above, the skin colour of the face is onlychecked during skin colour tracking. This means that non-skin-colouredfalse detections may be tracked, or the face track may wander off intonon-skin-coloured locations by using the predicted face position.

To improve on this, whatever the acceptance type of the face (detection,skin colour or Kalman prediction), its skin colour is checked. If itsdistance (difference) from skin colour exceeds a bad_colour_threshold,then the face track is terminated.

An efficient way to implement this is to use the distance from skincolour of each pixel calculated during skin colour tracking. If thismeasure, averaged over the face area (either over a mask shaped area,over an elliptical area or over the whole face window depending on whichskin colour tracking method is being used), exceeds a fixed threshold,then the face track is terminated.

A further criterion for rejecting a face track is that its variance isvery low or very high. This technique will be described below after thedescription of FIGS. 22 a to 22 c.

In the tracking system shown schematically in FIG. 14, three furtherfeatures are included.

Shot boundary data 560 (from metadata associated with the image sequenceunder test; or metadata generated within the camera of FIG. 2) definesthe limits of each contiguous “shot” within the image sequence. TheKalman filter is reset at shot boundaries, and is not allowed to carry aprediction over to a subsequent shot, as the prediction would bemeaningless.

User metadata 542 and camera setting metadata 544 are supplied as inputsto the face detector 540. These may also be used in a non-trackingsystem. Examples of the camera setting metadata were described above.User metadata may include information such as:

-   -   type of programme (e.g. news, interview, drama)    -   script information such as specification of a “long shot”,        “medium close-up” etc (particular types of camera shot leading        to an expected sub-range of face sizes), how many people        involved in each shot (again leading to an expected sub-range of        face sizes) and so on    -   sports-related information—sports are often filmed from fixed        camera positions using standard views and shots. By specifying        these in the metadata, again a sub-range of face sizes can be        derived

The type of programme is relevant to the type of face which may beexpected in the images or image sequence. For example, in a newsprogramme, one would expect to see a single face for much of the imagesequence, occupying an area of (say) 10% of the screen. The detection offaces at different scales can be weighted in response to this data, sothat faces of about this size are given an enhanced probability. Anotheralternative or additional approach is that the search range is reduced,so that instead of searching for faces at all possible scales, only asubset of scales is searched. This can reduce the processingrequirements of the face detection process. In a software-based system,the software can run more quickly and/or on a less powerful processor.In a hardware-based system (including for example anapplication-specific integrated circuit (ASIC) or field programmablegate array (FPGA) system) the hardware needs may be reduced.

The other types of user metadata mentioned above may also be applied inthis way. The “expected face size” sub-ranges may be stored in a look-uptable held in the memory 30, for example.

As regards camera metadata, for example the current focus and zoomsettings of the lens 110, these can also assist the face detector bygiving an initial indication of the expected image size of any facesthat may be present in the foreground of the image. In this regard, itis noted that the focus and zoom settings between them define theexpected separation between the camcorder 100 and a person being filmed,and also the magnification of the lens 110. From these two attributes,based upon an average face size, it is possible to calculate theexpected size (in pixels) of a face in the resulting image data, leadingagain to a sub-range of sizes for search or a weighting of the expectedface sizes.

This arrangement lends itself to use in a video conferencing orso-called digital signage environment.

In a video conferencing arrangement the user could classify the videomaterial as “individual speaker”, “Group of two”, “Group of three” etc,and based on this classification a face detector could derive anexpected face size and could search for and highlight the one or morefaces in the image.

In a digital signage environment, advertising material could bedisplayed on a video screen. Face detection is used to detect the facesof people looking at the advertising material.

Advantages of the Tracking Algorithm

The face tracking technique has three main benefits:

-   -   It allows missed faces to be filled in by using Kalman filtering        and skin colour tracking in frames for which no face detection        results are available. This increases the true acceptance rate        across the image sequence.    -   It provides face lining: by successfully tracking a face, the        algorithm automatically knows whether a face detected in a        future frame belongs to the same person or a different person.        Thus, scene metadata can easily be generated from this        algorithm, comprising the number of faces in the scene, the        frames for which they are present and providing a representative        mugshot of each face.    -   False face detections tend to be rejected, as such detections        tend not to carry forward between images.

FIGS. 19 a to 19 c schematically illustrate the use of face trackingwhen applied to a video scene.

In particular, FIG. 19 a schematically illustrates a video scene 800comprising successive video images (e.g. fields or frames) 810.

In this example, the images 810 contain one or more faces. In particularall of the images 810 in the scene include a face A, shown at an upperleft-hand position within the schematic representation of the image 810.Also, some of the images include a face B shown schematically at a lowerright hand position within the schematic representations of the images810.

A face tracking process is applied to the scene of FIG. 19 a. Face A istracked reasonably successfully throughout the scene. In one image 820the face is not tracked by a direct detection, but the skin colourmatching techniques and the Kalman filtering techniques described abovemean that the detection can be continuous either side of the “missing”image 820. The representation of FIG. 19 b indicates the detectedprobability of a face being present in each of the images. It can beseen that the probability is highest at an image 830, and so the part840 of the image detected to contain face A is used as a “picture stamp”in respect of face A. Picture stamps will be described in more detailbelow.

Similarly, face B is detected with different levels of confidence, butan image 850 gives rise to the highest detected probability of face Bbeing present. Accordingly, the part of the corresponding image detectedto contain face B (part 860) is used as a picture stamp for face Bwithin that scene. (Alternatively, of course, a wider section of theimage, or even the whole image, could be used as the picture stamp).

For each tracked face, a single representative face picture stamp isrequired. Outputting the face picture stamp based purely on faceprobability does not always give the best quality of picture stamp. Toget the best picture quality it would be better to bias or steer theselection decision towards faces that are detected at the sameresolution as the picture stamp, e.g. 64×64 pixels

To get the best quality picture stamps the following scheme may beapplied:

(1) Use a face that was detected (not colour tracked/Kalman tracked)

(2) Use a face that gave a high probability during face detection, i.e.at least a threshold probability

(3) Use a face which is as close as possible to 64×64 pixels, to reduceresealing artefacts and improve picture quality

(4) Do not (if possible) use a very early face in the track, i.e. a facein a predetermined initial portion of the tracked sequence (e.g. 10% ofthe tracked sequence, or 20 frames, etc) in case this means that theface is still very distant (i.e. small) and blurry

Some rules that could achieve this are as follows:

For each face detection:

Calculate the metric M=face_probability*size_weighting, wheresize_weighting=MIN((face_size/64)ˆx, (64/face size)ˆx) and x=0.25. Thentake the face picture stamp for which M is largest.

This gives the following weightings on the face probability for eachface size: face_size size_weighting 16 0.71 19 0.74 23 0.77 27 0.81 320.84 38 0.88 45 0.92 54 0.96 64 1.00 76 0.96 91 0.92 108 0.88 128 0.84152 0.81 181 0.77 215 0.74 256 0.71 304 0.68 362 0.65 431 0.62 512 0.59

In practice this could be done using a look-up table.

To make the weighting function less harsh, a smaller power than 0.25,e.g x=0.2 or 0.1, could be used.

This weighting technique could be applied to the whole face track orjust to the first N frames (to apply a weighting against the selectionof a poorly-sized face from those N frames). N could for examplerepresent just the first one or two seconds (25-50 frames).

In addition, preference is given to faces that are frontally detectedover those that were detected at ±30 degrees (or any other pose).

FIG. 20 schematically illustrates a display screen of a non-linearediting system.

Non-linear editing systems are well established and are generallyimplemented as software programs running on general purpose computingsystems such as the system of FIG. 1. These editing systems allow video,audio and other material to be edited to an output media product in amanner which does not depend on the order in which the individual mediaitems (e.g. video shots) were captured.

The schematic display screen of FIG. 20 includes a viewer area 900, inwhich video clips be may viewed, a set of clip icons 910, to bedescribed further below and a “timeline” 920 including representationsof edited video shots 930, each shot optionally containing a picturestamp 940 indicative of the content of that shot.

At one level, the face picture stamps derived as described withreference to FIGS. 19 a to 19 c could be used as the picture stamps 940of each edited shot so, within the edited length of the shot, which maybe shorter than the originally captured shot, the picture stamprepresenting a face detection which resulted in the highest faceprobability value can be inserted onto the time line to show arepresentative image from that shot. The probability values may becompared with a threshold, possibly higher than the basic face detectionthreshold, so that only face detections having a high level ofconfidence are used to generate picture stamps in this way. If more thanone face is detected in the edited shot, the face with the highestprobability may be displayed, or alternatively more than one facepicture stamp may be displayed on the time line.

Time lines in non-linear editing systems are usually capable of beingscaled, so that the length of line corresponding to the full width ofthe display screen can represent various different time periods in theoutput media product. So, for example, if a particular boundary betweentwo adjacent shots is being edited to frame accuracy, the time line maybe “expanded” so that the width of the display screen represents arelatively short time period in the output media product. On the otherhand, for other purposes such as visualising an overview of the outputmedia product, the time line scale may be contracted so that a longertime period may be viewed across the width of the display screen. So,depending on the level of expansion or contraction of the time linescale, there may be less or more screen area available to display eachedited shot contributing to the output media product.

In an expanded time line scale, there may well be more than enough roomto fit one picture stamp (derived as shown in FIGS. 19 a to 19 c) foreach edited shot making up the output media product. However, as thetime line scale is contracted, this may no longer be possible. In suchcases, the shots may be grouped together in to “sequences”, where eachsequence is such that it is displayed at a display screen size largeenough to accommodate a phase picture stamp. From within the sequence,then, the face picture stamp having the highest correspondingprobability value is selected for display. If no face is detected withina sequence, an arbitrary image, or no image, can be displayed on thetimeline.

FIG. 20 also shows schematically two “face timelines” 925, 935. Thesescale with the “main” timeline 920. Each face timeline relates to asingle tracked face, and shows the portions of the output editedsequence containing that tracked face. It is possible that the user mayobserve that certain faces relate to the same person but have not beenassociated with one another by the tracking algorithm. The user can“link” these faces by selecting the relevant parts of the face timelines(using a standard Windows™ selection technique for multiple items) andthen clicking on a “link” screen button (not shown). The face timelineswould then reflect the linkage of the whole group of face detectionsinto one longer tracked face. FIGS. 21 a and 21 b schematicallyillustrate two variants of clip icons 910′ and 910″. These are displayedon the display screen of FIG. 20 to allow the user to select individualclips for inclusion in the time line and editing of their start and endpositions (in and out points). So, each clip icon represents the wholeof a respective clip stored on the system.

In FIG. 21 a, a clip icon 910″ is represented by a single face picturestamp 912 and a text label area 914 which may include, for example, timecode information defining the position and length of that clip. In analternative arrangement shown in FIG. 21 b, more than one face picturestamp 916 may be included by using a multi-part clip icon.

Another possibility for the clip icons 910 is that they provide a “facesummary” so that all detected faces are shown as a set of clip icons910, in the order in which they appear (either in the source material orin the edited output sequence). Again, faces that are the same personbut which have not been associated with one another by the trackingalgorithm can be linked by the user subjectively observing that they arethe same face. The user could select the relevant face clip icons 910(using a standard Windows™ selection technique for multiple items) andthen click on a “link” screen button (not shown). The tracking datawould then reflect the linkage of the whole group of face detectionsinto one longer tracked face.

A further possibility is that the clip icons 910 could provide ahyperlink so that the user may click on one of the icons 910 which wouldthen cause the corresponding clip to be played in the viewer area 900.

A similar technique may be used in, for example, a surveillance orclosed circuit television (CCTV) system. Whenever a face is tracked, orwhenever a face is tracked for at least a predetermined number offrames, an icon similar to a clip icon 910 is generated in respect ofthe continuous portion of video over which that face was tracked. Theicon is displayed in a similar manner to the clip icons in FIG. 20.Clicking on an icon causes the replay (in a window similar to the viewerarea 900) of the portion of video over which that particular face wastracked. It will be appreciated that multiple different faces could betracked in this way, and that the corresponding portions of video couldoverlap or even completely coincide.

FIGS. 22 a to 22 c schematically illustrate a gradient pre-processingtechnique.

It has been noted that image windows showing little pixel variation cantend to be detected as faces by a face detection arrangement based oneigenfaces or eigenblocks. Therefore, a pre-processing step is proposedto remove areas of little pixel variation from the face detectionprocess. In the case of a multiple scale system (see above) thepre-processing step can be carried out at each scale.

The basic process is that a “gradient test” is applied to each possiblewindow position across the whole image. A predetermined pixel positionfor each window position, such as the pixel at or nearest the centre ofthat window position, is flagged or labelled in dependence on theresults of the test applied to that window. If the test shows that awindow has little pixel variation, that window position is not used inthe face detection process.

A first step is illustrated in FIG. 22 a. This shows a window at anarbitrary window position in the image. As mentioned above, thepre-processing is repeated at each possible window position. Referringto FIG. 22 a, although the gradient preprocessing could be applied tothe whole window, it has been found that better results are obtained ifthe pre-processing is applied to a central area 1000 of the test window1010.

Referring to FIG. 22 b, a gradient-based measure is derived from thewindow (or from the central area of the window as shown in FIG. 22 a),which is the average of the absolute differences between all adjacentpixels 1011 in both the horizontal and vertical directions, taken overthe window. Each window centre position is labelled with thisgradient-based measure to produce a gradient “map” of the image. Theresulting gradient map is then compared with a threshold gradient value.Any window positions for which the gradient-based measure lies below thethreshold gradient value are excluded from the face detection process inrespect of that image.

Alternative gradient-based measures could be used, such as the pixelvariance or the mean absolute pixel difference from a mean pixel value.

The gradient-based measure is preferably carried out in respect of pixelluminance values, but could of course be applied to other imagecomponents of a colour image.

FIG. 22 c schematically illustrates a gradient map derived from anexample image. Here a lower gradient area 1070 (shown shaded) isexcluded from face detection, and only a higher gradient area 1080 isused. The embodiments described above have related to a face detectionsystem (involving training and detection phases) and possible uses forit in a camera-recorder and an editing system. It will be appreciatedthat there are many other possible uses of such techniques, for example(and not limited to) security surveillance systems, media handling ingeneral (such as video tape recorder controllers), video conferencingsystems and the like.

In other embodiments, window positions having high pixel differences canalso be flagged or labelled, and are also excluded from the facedetection process. A “high” pixel difference means that the measuredescribed above with respect to FIG. 22 b exceeds an upper thresholdvalue.

So, a gradient map is produced as described above. Any positions forwhich the gradient measure is lower than the (first) threshold gradientvalue mentioned earlier are excluded from face detection processing, asare any positions for which the gradient measure is higher than theupper threshold value.

It was mentioned above that the “lower threshold” processing ispreferably applied to a central part 1000 of the test window 1010. Thesame can apply to the “upper threshold” processing. This would mean thatonly a single gradient measure needs to be derived in respect of eachwindow position. Alternatively, if the whole window is used in respectof the lower threshold test, the whole window can similarly be used inrespect of the upper threshold test. Again, only a single gradientmeasure needs to be derived for each window position. Of course,however, it is possible to use two different arrangements, so that (forexample) a central part 1000 of the test window 1010 is used to derivethe gradient measure for the lower threshold test, but the full testwindow is used in respect of the upper threshold test.

A further criterion for rejecting a face track, mentioned earlier, isthat its variance or gradient measure is very low or very high.

In this technique a tracked face position is validated by variance fromarea of interest map. Only a face-sized area of the map at the detectedscale is stored per face for the next iteration of tracking.

Despite the gradient pre-processing described above, it is stillpossible for a skin colour tracked or Kalman predicted face to move intoa (non-face-like) low or high variance area of the image. So, duringgradient pre-processing, the variance values (or gradient values) forthe areas around existing face tracks are stored.

When the final decision on the face's next position is made (with anyacceptance type, either face detection, skin colour or Kalmanprediction) the position is validated against the stored variance (orgradient) values in the area of interest map. If the position is foundto have very high or very low variance (or gradient), it is consideredto be non-face-like and the face track is terminated. This prevents facetracks from wandering onto low (or high) variance background areas ofthe image.

Alternatively, even if gradient pre-processing is not used, the varianceof the new face position can be calculated afresh. In either case thevariance measure used can either be traditional variance or the sum ofdifferences of neighbouring pixels (gradient) or any other variance-typemeasure.

FIG. 23 schematically illustrates a video conferencing system. Two videoconferencing stations 1100, 1110 are connected by a network connection1120 such as: the Internet, a local or wide area network, a telephoneline, a high bit rate leased line, an ISDN line etc. Each of thestations comprises, in simple terms, a camera and associated sendingapparatus 1130 and a display and associated receiving apparatus 1140.Participants in the video conference are viewed by the camera at theirrespective station and their voices are picked up by one or moremicrophones (not shown in FIG. 23) at that station. The audio and videoinformation is transmitted via the network 1120 to the receiver 1140 atthe other station. Here, images captured by the camera are displayed andthe participants' voices are produced on a loudspeaker or the like.

It will be appreciated that more than two stations may be involved inthe video conference, although the discussion here will be limited totwo stations for simplicity.

FIG. 24 schematically illustrates one channel, being the connection ofone camera/sending apparatus to one display/receiving apparatus.

At the camera/sending apparatus, there is provided a video camera 1150,a face detector 1160 using the techniques described above, an imageprocessor 1170 and a data formatter and transmitter 1180. A microphone1190 detects the participants' voices.

Audio, video and (optionally) metadata signals are transmitted from theformatter and transmitter 1180, via the network connection 1120 to thedisplay/receiving apparatus 1140. Optionally, control signals arereceived via the network connection 1120 from thedisplay/receiving-apparatus 1140.

At the display/receiving apparatus, there is provided a display anddisplay processor 1200, for example a display screen and associatedelectronics, user controls 1210 and an audio output arrangement 1220such as a digital to analogue (DAC) converter, an amplifier and aloudspeaker.

In general terms, the face detector 1160 detects (and optionally tracks)faces in the captured images from the camera 1150. The face detectionsare passed as control signals to the image processor 1170. The imageprocessor can act in various different ways, which will be describedbelow, but fundamentally the image processor 1170 alters the imagescaptured by the camera 1150 before they are transmitted via the network1120. A significant purpose behind this is to make better use of theavailable bandwidth or bit rate which can be carried by the networkconnection 1120. Here it is noted that in most commercial applications,the cost of a network connection 1120 suitable for video conferencepurposes increases with an increasing bit rate requirement. At theformatter and transmitter 1180 the images from the image processor 1170are combined with audio signals from the microphone 1190 (for example,having been converted via an analogue to digital converter (ADC)) andoptionally metadata defining the nature of the processing carried out bythe image processor 1170.

Various modes of operation of the video conferencing system will bedescribed below.

FIG. 25 is a further schematic representation of the video conferencingsystem. Here, the functionality of the face detector 1160, the imageprocessor 1170, the formatter and transmitter 1180 and the processoraspects of the display and display processor 1200 are carried out byprogrammable personal computers 1230. The schematic displays shown onthe display screens (part of 1200) represent one possible mode of videoconferencing using face detection which will be described below withreference to FIG. 31, namely that only those image portions containingfaces are transmitted from one location to the other, and are thendisplayed in a tiled or mosaic form at the other location. As mentioned,this mode of operation will be discussed below.

FIG. 26 is a flowchart schematically illustrating a mode of operation ofthe system of FIGS. 23 to 25. The flowcharts of FIGS. 26, 28, 31, 33 and34 are divided into operations carried out at the camera/sender end(1130) and those carried out at the display/receiver end (1140).

So, referring to FIG. 26, the camera 1150 captures images at a step1300. At a step 1310, the face detector 1160 detects faces in thecaptured images. Ideally, face tracking (as described above) is used toavoid any spurious interruptions in the face detection and to providethat a particular person's face is treated in the same way throughoutthe video conferencing session.

At a step 1320, the image processor 1170 crops the captured images inresponse to the face detection information. This may be done as follows:

-   -   first, identify the upper left-most face detected by the face        detector 1160    -   detect the upper left-most extreme of that face; this forms the        upper left comer of the cropped image    -   repeat for the lower right-most face and the lower right-most        extreme of that face to form the lower right comer of the        cropped image    -   crop the image in a rectangular shape based on these two        co-ordinates.

The cropped image is then transmitted by the formatter and transmitted1180. In this instance, there is no need to transmit additionalmetadata. The cropping of the image allows either a reduction in bitrate compared to the full image or an improvement in transmissionquality while maintaining the same bit rate.

At the receiver, the cropped image is displayed at a full screen displayat a step 1130.

Optionally, a user control 1210 can toggle the image processor 1170between a mode in which the image is cropped and a mode in which it isnot cropped. This can allow the participants at the receiver end to seeeither the whole room or just the face-related parts of the image.

Another technique for cropping the image is as follows:

-   -   identify the leftmost and rightmost faces    -   maintaining the aspect ratio of the shot, locate the faces in        the upper half of the picture.

In an alternative to cropping, the camera could be zoomed so that thedetected faces are featured more significantly in the transmittedimages. This could, for example, be combined with a bit rate reductiontechnique on the resulting image. To achieve this, a control of thedirectional (pan/tilt) and lens zoom properties of the camera is madeavailable to the image processor (represented by a dotted line 1155 inFIG. 24)

FIGS. 27 a and 27 b are example images relating to the flowchart of FIG.26. FIG. 27 a represents a full screen image as captured by the camera1150, whereas FIG. 27 b represents a zoomed version of that image.

FIG. 28 is a flowchart schematically illustrating another mode ofoperation of the system of FIGS. 23 to 25. Step 1300 is the same as thatshown in FIG. 26.

At a step 1340, each face in the captured images is identified andhighlighted, for example by drawing a box around that face for display.Each face is also labelled, for example with an arbitrary label a, b, c. . . . Here, face tracking is particularly useful to avoid anysubsequent confusion over the labels. The labelled image is formattedand transmitted to the receiver where it is displayed at a step 1350. Ata step 1360, the user selects a face to be displayed, for example bytyping the label relating to that face. The selection is passed ascontrol data back to the image processor 1170 which isolates therequired face at a step 1370. The required face is transmitted to thereceiver. At a step 1380 the required face is displayed. The user isable to select a different face by the step 1360 to replace thecurrently displayed face. Again, this arrangement allows a potentialsaving in bandwidth, in that the selection screen may be transmitted ata lower bit rate because it is only used for selecting a face to bedisplayed. Alternatively, as before, the individual faces, onceselected, can be transmitted at an enhanced bit rate to achieve a betterquality image.

FIG. 29 is an example image relating to the flowchart of FIG. 28. Here,three faces have been identified, and are labelled a, b and c. By typingone of those three letters into the user controls 1210, the user canselect one of those faces for a full-screen display. This can beachieved by a cropping of the main image or by the camera zooming ontothat face as described above. FIG. 30 shows an alternativerepresentation, in which so-called thumbnail images of each face aredisplayed as a menu for selection at the receiver.

FIG. 31 is a flowchart schematically illustrating a further mode ofoperation of the system of FIGS. 23 to 25. The steps 1300 and 1310correspond to those of FIG. 26.

At a step 1400, the image processor 1170 and the formatter andtransmitter 1180 co-operate to transmit only thumbnail images relatingto the captured faces. These are displayed as a menu or mosaic of facesat the receiver end at a step 1410. At a step 1420, optionally, the usercan select just one face for enlarged display. This may involve keepingthe other faces displayed in a smaller format on the same screen or theother faces may be hidden while the enlarged display is used. So adifference between this arrangement and that of FIG. 28 is thatthumbnail images relating to all of the faces are transmitted to thereceiver, and the selection is made at the receiver end as to how thethumbnails are to be displayed.

FIG. 32 is an example image relating to the flowchart of FIG. 31. Here,an initial screen could show three thumbnails, 1430, but the stageillustrated by FIG. 32 is that the face belonging to participant c hasbeen selected for enlarged display on a left hand part of the displayscreen. However, the thumbnails relating to the other participants areretained so that the user can make a sensible selection of a next faceto be displayed in enlarged form.

It should be noted that, at least in a system where the main image iscropped, the thumbnail images referred to in these examples are “live”thumbnail images, albeit taking into account any processing delayspresent in the system. That is to say, the thumbnail images vary intime, as the captured images of the participants vary. In a system usinga camera zoom, then the thumbnails could be static or a second cameracould be used to capture the wider angle scene.

FIG. 33 is a flowchart schematically illustrating a further mode ofoperation. Here, the steps 1300 and 1310 correspond to those of FIG. 26.

At a step 1440 a thumbnail face image relating to the face detected tobe nearest to an active microphone is transmitted. Of course, thisrelies on having more than one microphone and also a pre-selection ormetadata defining which participant is sitting near to which microphone.This can be set up in advance by a simple menu-driven table entry by theusers at each video conferencing station. The active microphone isconsidered to be the microphone having the greatest magnitude audiosignal averaged over a certain time (such as one second). A low passfiltering arrangement can be used to avoid changing the activemicrophone too often, for example in response to a cough or an objectbeing dropped, or two participants speaking at the same time.

At a step 1450 the transmitted face is displayed. A step 1460 representsthe quasi-continuous detection of a current active microphone.

The detection could be, for example, a detection of a single activemicrophone or alternatively a simple triangulation technique coulddetect the speaker's position based on multiple microphones.

Finally, FIG. 34 is a flowchart schematically illustrating another modeof operation, again in which the steps 1300 and 1310 correspond to thoseof FIG. 26.

At a step 1470 the parts of the captured images immediately surroundingeach face are transmitted at a higher resolution and the background(other parts of the captured images) is transmitted at a lowerresolution. This can achieve a useful saving in bit rate or allow anenhancement of the parts of the image surrounding each face. Optionally,metadata can be transmitted defining the position of each face, or thepositions may be derived at the receiver by noting the resolution ofdifferent parts of the image.

At a step 1480, at the receiver end the image is displayed and the facesare optionally labelled for selection by a user at a step 1490 thisselection could cause the selected face to be displayed in a largerformat similar to the arrangement of FIG. 32.

Although the description of FIGS. 23 to 34 has related to videoconferencing systems, the same techniques could be applied to, forexample, security monitoring (CCTV) systems. Here, a return channel isnot normally required, but an arrangement as shown in FIG. 24, where thecamera/sender arrangement is provided as a CCTV camera, and thereceiver/display arrangement is provided at a monitoring site, could usethe same techniques as those described for video conferencing.

It will be appreciated that the embodiments of the invention describedabove may of course be implemented, at least in part, usingsoftware-controlled data processing apparatus. For example, one or moreof the components schematically illustrated or described above may beimplemented as a software-controlled general purpose data processingdevice or a bespoke program controlled data processing device such as anapplication specific integrated circuit, a field programmable gate arrayor the like. It will be appreciated that a computer program providingsuch software or program control and a storage, transmission or otherproviding medium by which such a computer program is stored areenvisaged as aspects of the present invention.

The list of references and appendices follow. For the avoidance ofdoubt, it is noted that the list and the appendices form a part of thepresent description. These documents are all incorporated by reference.

REFERENCES

-   1. H. Schneiderman and T. Kanade, “A statistical model for 3D object    detection applied to faces and cars,” IEEE Conference on Computer    Vision and Pattern Detection, 2000.-   2. H. Schneiderman and T. Kanade, “Probabilistic modelling of local    appearance and spatial relationships for object detection,” IEEE    Conference on Computer Vision and Pattern Detection, 1998.-   3. H. Schneiderman, “A statistical approach to 3D object detection    applied to faces and cars,” PhD thesis, Robotics Institute, Carnegie    Mellon University, 2000.-   4. E. Hjelmas and B. K Low, “Face Detection: A Survey,” Computer    Vision and Image Understanding, no.83, pp.236-274, 2001.-   5. M.-H. Yang, D. Kriegman and N. Ahuja, “Detecting Faces in Images:    A Survey,” IEEE Trans. on Pattern Analysis and Machine Intelligence,    vol.24, no.1, pp.34-58, January 2002.

Appendix A: Training Face Sets

One database consists of many thousand images of subjects standing infront of an indoor background. Another trig database used inexperimental implementations of the above techniques consists of morethan ten thousand eight-bit greyscale images of human heads with viewsranging from frontal to left and right profiles. The skilled man will ofcourse understand that various different training sets could be used,optionally being profiled to reflect facial characteristics of a localpopulation.

Appendix B—Eigenblocks

In the eigenface approach to face detection and recognition (References4 and 5), each m-by-n face image is reordered so that it is representedby a vector of length mn. Each image can then be thought of as a pointin mn-dimensional space. A set of images maps to a collection of pointsin this large space.

Face images, being similar in overall configuration, are not randomlydistributed in this mn-dimensional image space and therefore they can bedescribed by a relatively low lo dimensional subspace. Using principalcomponent analysis (PCA), the vectors that best account for thedistribution of face images within the entire image space can be found.PCA involves determining the principal eigenvectors of the covariancematrix corresponding to the original face images. These vectors definethe subspace of face images, often referred to as the face space. Eachvector represents an m-by-n image and is a linear combination of theoriginal face images. Because the vectors are the eigenvectors of thecovariance matrix corresponding to the original face images, and becausethey are face-like in appearance, they are often referred to aseigenfaces [4].

When an unknown image is presented, it is projected into the face space.In this way, it is expressed in terms of a weighted sum of eigenfaces.

In the present embodiments, a closely related approach is used, togenerate and apply so-called “eigenblocks” or eigenvectors relating toblocks of the face image. A grid of blocks is applied to the face image(in the training set) or the test window (during the detection phase)and an eigenvector-based process, very similar to the eigenface process,is applied at each block position. (Or in an alternative embodiment tosave on data processing, the process is applied once to the group ofblock positions, producing one set of eigenblocks for use at any blockposition). The skilled man will understand that some blocks, such as acentral block often representing a nose feature of the image, may bemore significant in deciding whether a face is present.

Calculating Eigenblocks

The calculation of eigenblocks involves the following steps:

-   (1). A training set of N_(T) images is used. These are divided into    image blocks each of size m×n. So, for each block position a set of    image blocks, one from that position in each image, is obtained:    {I_(o) ^(t)}_(t=1) ^(N) ^(T) .-   (2). A normalised training set of blocks {I^(t)}_(t=1) ^(N) ^(T) ,    is calculated as follows:

Each image block, I_(o) ^(t), from the original training set isnormalised to have a mean of zero and an L2-norm of 1, to produce arespective normalised image block, I^(t). For each image block, I_(o)^(t), t=1..N_(T): $\begin{matrix}{I^{t} = \frac{I_{o}^{t} - {mean\_ I}_{o}^{t}}{{I_{o}^{t} - {mean\_ I}_{o}^{t}}}} \\{{{{where}\quad{mean\_ I}_{o}^{t}} = {\frac{1}{mn}{\sum\limits_{i = 1}^{m}{\sum\limits_{j = 1}^{n}{I_{o}^{t}\left\lbrack {i,j} \right\rbrack}}}}}{{{and}\quad{{I_{o}^{t} - {mean\_ I}_{o}^{t}}}} = \sqrt{\sum\limits_{i = 1}^{m}{\sum\limits_{j = 1}^{n}\left( {{I_{o}^{t}\left\lbrack {i,j} \right\rbrack} - {mean\_ I}_{o}^{t}} \right)^{2}}}}}\end{matrix}$

(i.e. the L2-norm of (I_(o) ^(t)-mean_I_(o) ^(t)))

-   (3). A training set of vectors {x^(t)}_(t'1) ^(N) ^(T) is formed by    lexicographic reordering of the pixel elements of each image block,    I^(t). i.e. Each m-by-n image block, I^(t), is reordered into a    vector, x^(t), of length N=mn.-   (4). The set of deviation vectors, D={x^(t)}_(t=1) ^(N) ^(T) , is    calculated. D has N rows and N_(T) columns.-   (5). The covariance matrix, Σ, is calculated:    Σ=DD^(T)

Σ is a symmetric matrix of size N×N.

-   (7). The whole set of eigenvectors, P, and eigenvalues, λ_(i), i=1,    . . . , N, of the covariance matrix, Σ, are given by solving:    Λ=P^(T)ΣP

Here, A is an N×N diagonal matrix with the eigenvalues, λ_(i), along itsdiagonal (in order of magnitude) and P is an N×N matrix containing theset of N eigenvectors, each of length N. This decomposition is alsoknown as a Karhunen-Loeve Transform (KLT).

The eigenvectors can be thought of as a set of features that togethercharacterise the variation between the blocks of the face images. Theyform an orthogonal basis by which many image block can be represented,i.e. in principle any image can be represented without error by aweighted sum of the eigenvectors.

If the number of data points in the image space (the number of trainingimages) is less than the dimension of the space (N_(T)<N), then therewill only be N_(T) meaningful eigenvectors. The remaining eigenvectorswill have associated eigenvalues of zero. Hence, because typicallyN_(T)<N, all eigenvalues for which i>N_(T) will be zero.

Additionally, because the image blocks in the training set are similarin overall configuration (they are all derived from faces), only some ofthe remaining eigenvectors will characterise very strong differencesbetween the image blocks. These are the eigenvectors with the largestassociated eigenvalues. The other remaining eigenvectors with smallerassociated eigenvalues do not characterise such large differences andtherefore they are not as useful for detecting or distinguishing betweenfaces.

Therefore, in PCA, only the M principal eigenvectors with the largestmagnitude eigenvalues are considered, where M<N_(T) i.e. a partial KLTis performed. In short, PCA extracts a lower-dimensional subspace of theKLT basis corresponding to the largest magnitude eigenvalues.

Because the principal components describe the strongest variationsbetween the face images, in appearance they may resemble parts of faceblocks and are referred to here as eigenblocks. However, the termeigenvectors could equally be used.

Face Detection using Eigenblocks

The similarity of an unknown image to a face, or its faceness, can bemeasured by determining how well the image is represented by the facespace. This process is carried out on a block-by-block basis, using thesame grid of blocks as that used in the training process.

The first stage of this process involves projecting the image into theface space.

Projection of an Image into Face Space

Before projecting an image into face space, much the same pre-processingsteps are performed on the image as were performed on the training set:

-   (1). A test image block of size m×n is obtained: I_(o).-   (2). The original test image block, I_(o) is normalised to have a    mean of zero and an L2-norm of 1, to produce the normalised test    image block, I:    $I = \frac{I_{o} - {mean\_ I}_{o}}{{I_{o} - {mean\_ I}_{o}}}$    ${{where}\quad{mean\_ I}_{o}} = {\frac{1}{mn}{\sum\limits_{i = 1}^{m}{\sum\limits_{j = 1}^{n}{I_{o}\left\{ {i,j} \right\rbrack}}}}$    ${and}\quad{{I_{o} - {mean\_ I}_{o}}}\sqrt{\sum\limits_{i = 1}^{m}{\sum\limits_{j = 1}^{n}\left( {{I_{o}\left\lbrack {i,j} \right\rbrack} - {mean\_ I}_{o}} \right)^{2}}}$

(i.e. the L2-norm of (I_(o)-mean_I_(o)))

-   (3). The deviation vectors are calculated by lexicographic    reordering of the pixel elements of the image. The image is    reordered into a deviation vector, x^(t), of length N=mn.

After these pre-processing steps, the deviation vector, x, is projectedinto face space using the following simple step:

-   (4). The projection into face space involves transforming the    deviation vector, x, into its eigenblock components. This involves a    simple multiplication by the M principal eigenvectors (the    eigenblocks), P_(i), i=1, . . . , M. Each weight y_(i) is obtained    as follows:    y_(i)=P_(i) ^(T)x    where P_(i) is the i^(th) eigenvector.

The weights y_(l , i=)1, . . . , M, describe the contribution of eacheigenblock in representing the input face block.

Blocks of similar appearance will have similar sets of weights whileblocks of different appearance will have different sets of weights.Therefore, the weights are used here as feature vectors for classifyingface blocks during face detection.

1. Face detection apparatus in which an image region of a test image iscompared with data indicative of the presence of a face; the apparatuscomprising: a pre-processor operable to identify low-difference regionsof the test image where there exists less than a threshold imagedifference across groups of pixels within those regions; and a facedetector operable to perform face detection on regions of the test imageother than those identified by the pre-processor as low-differenceregions.
 2. Apparatus according to claim 1, in which the region is arectangular region; the pre-processor operating to identifylow-difference regions only with respect to pixels in a central portionof the regions.
 3. Apparatus according to claim 2, in which the centralportion of a region comprises all of the region except for two strips,one at each side of the region.
 4. Apparatus according to claim 1, inwhich the pre-processor is operable to identify high-difference regionsof the test image where there exists greater than a threshold imagedifference across groups of pixels within those regions; and a facedetector operable to perform face detection on regions of the test imageother than those identified by the pre-processor as low-differenceregions or high-difference regions.
 5. Apparatus according to claim 1,in which the face detector is operable: to derive a set of attributesfrom respective blocks of a region; to compare the derived attributeswith attributes indicative of the presence of a face; to derive aprobability of the presence of a face by a similarity between thederived attributes and the attributes indicative of the presence of aface; and to compare the probability with a threshold probability. 6.Apparatus according to claim 5, in which the attributes comprise theprojections of image areas onto one or more image eigenvectors. 7.Apparatus according to claim 1, in which the groups of pixels comprisepairs of adjacent pixels.
 8. Video conferencing apparatus comprisingapparatus according to claim
 1. 9. Surveillance apparatus comprisingapparatus according to claim
 1. 10. A method of face detection, in whichan image region of a test image is compared with data indicative of thepresence of a face; the method comprising the steps of: identifyinglow-difference regions of the test image where there exists less than athreshold image difference across groups of pixels within those regions;and performing face detection on regions of the test image other thanthose identified by the pre-processor as low-difference regions. 11.Computer software having program code for carrying out a methodaccording to claim
 10. 12. A providing medium for providing program codeaccording to claim
 11. 13. A medium according to claim 12, the mediumbeing a storage medium.
 14. A medium according to claim 12, the mediumbeing a transmission medium.